install.packages("tidyverse")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.1/tidyverse_1.3.1.tgz'
Content type 'application/x-gzip' length 421072 bytes (411 KB)
==================================================
downloaded 411 KB
The downloaded binary packages are in
/var/folders/lr/hrpr063x28jd67wnc08d040w0000gn/T//RtmpgikMqG/downloaded_packages
install.packages("nycflights13")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.1/nycflights13_1.0.2.tgz'
Content type 'application/x-gzip' length 4502373 bytes (4.3 MB)
==================================================
downloaded 4.3 MB
The downloaded binary packages are in
/var/folders/lr/hrpr063x28jd67wnc08d040w0000gn/T//RtmpgikMqG/downloaded_packages
library(nycflights13)
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ────────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5 ✓ purrr 0.3.4
✓ tibble 3.1.3 ✓ dplyr 1.0.7
✓ tidyr 1.1.3 ✓ stringr 1.4.0
✓ readr 2.0.1 ✓ forcats 0.5.1
── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Using pipe (%>%) to simplify your code:
Now, suppose we are trying to generate a new tibble that contains all flights in July and adds a new column, reader = “John”.
flights
(flights_July = filter(flights, month == 7))
(flights_July_ZY = mutate(flights_July, reader="John"))
Let’s use Pipe to simplify the code with three steps: 1. remove the name of each newly generated tibbles 2. remove the first argument of each operation verbs 3. add “%>%”
flights %>%
filter(month == 7) %>%
mutate(reader="John")
Group Challenge: Investigate the relationship between the distance and average delay for each location in July? Hint: Use summarise, group_by, filter, arrange
flights %>%
filter(month == 7) %>%
group_by(dest) %>%
summarise(dist = mean(distance, na.rm= TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(delay))
NA
install.packages("covid19.analytics")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.1/covid19.analytics_2.1.tgz'
Content type 'application/x-gzip' length 3562127 bytes (3.4 MB)
==================================================
downloaded 3.4 MB
The downloaded binary packages are in
/var/folders/lr/hrpr063x28jd67wnc08d040w0000gn/T//RtmpgikMqG/downloaded_packages
library(covid19.analytics)
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
library(tidyverse)
# obtain time series data for "confirmed" cases
(covid19.confirmed.cases <- covid19.data("ts-confirmed"))
Data being read from JHU/CCSE repository
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
Data retrieved on 2021-09-07 14:10:59 || Range of dates on data: 2020-01-22--2021-09-06 | Nbr of records: 279
--------------------------------------------------------------------------------
# First check the number of variables. A function that helps us explore the number of days/weeks between two dates.
difftime("2021-9-6", "2020-1-22", units = "days")
Time difference of 592.9583 days
Individual excercise (3 mins)
# Find the top ten countries that have the largest number of confirmed cases on 2021-09-02 (recommend using pipe!)
# Hint-1: the following verbs are recommended: select, group_by, summarise, filter
# Hint-2: filter(rank( desc(X) )) <= 10 will help you identify the top ten values of X
covid19.confirmed.cases %>%
select(Country.Region, `2021-09-02`) %>%
group_by(Country.Region) %>%
summarise(case_sum_02 = sum(`2021-09-02`, na.rm = TRUE)) %>%
filter(rank(desc(case_sum_02)) <= 10 )
NA
Individual excercise (5 mins): how to define daily growth rate?
# Adapt the above code to find the top ten countries that have the highest daily growth rate of confirmed cases on 2021-09-02.
#Hint-1: define a new variable, growth_rate, either within summarise() or using mutate()
#Hint-2: daily growth rate = (cases_day_2 - cases_day_1)/cases_day_2
covid19.confirmed.cases %>%
select(Country.Region, `2021-09-02`,`2021-09-01` ) %>%
group_by(Country.Region) %>%
summarise(case_sum_02 = sum(`2021-09-02`, na.rm = TRUE), case_sum_01 = sum(`2021-09-01`, na.rm = TRUE)) %>%
mutate(growth_rate = (case_sum_02 - case_sum_01) / case_sum_01 ) %>%
filter(rank(desc(growth_rate)) <= 10 )
NA
(CV_US = covid19.US.data() )
Data being read from JHU/CCSE repository
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv
Data retrieved on 2021-09-09 13:25:06 || Range of dates on data: 2020-01-22--2021-09-08 | Nbr of records: 3342
--------------------------------------------------------------------------------
Data being read from JHU/CCSE repository
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv
Data retrieved on 2021-09-09 13:25:08 || Range of dates on data: 2020-01-22--2021-09-08 | Nbr of records: 3342
--------------------------------------------------------------------------------
Use across() function to select a range of variables
# Operation by row
# Create a new Table, CV_US_Aug, that contains the by-state sum of daily confirmed cases from 2021-08-01 to 2021-08-31
(CV_US_Aug <- CV_US %>%
group_by(Province_State) %>%
summarise(across(`2021-08-01`:`2021-08-31`, sum))
)
NA
Individual excercise (3 mins): use rowSums() to sum by row.
# Find the monthly sum of confirmed cases for each US state in August, 2021.
# Hint-1: rowSums(across(`date-1`:`date-2`), na.rm = TRUE) will return the by-row sum of values in a tibble.
CV_US_Aug %>%
group_by(Province_State) %>%
summarise( Sum_Aug_2021 = rowSums(across(`2021-08-01`:`2021-08-31`, na.rm = TRUE)))
# Your code goes here
Group challenge*: Identify the top ten states that have the highest mean growth rate of COVID cases in August, 2021.
Hint-1: across() allows arithmetic operations for a range of variables. Hint-2: rowMeans() will return the by-row average of values in a tibble
CV_US %>%
group_by(Province_State) %>%
summarise( Sum_Aug_2021 = rowSums(across(`2021-08-01`:`2021-08-31`, na.rm = TRUE))
)
`summarise()` has grouped output by 'Province_State'. You can override using the `.groups` argument.
mutate()
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "NULL"